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Abstract — We consider the problem of classification of an object given 
multiple observations that possibly include different transformations. The 
possible transformations of the object generally span a low-dimensional 
manifold in the original signal space. We propose to take advantage of this 
manifold structure for the effective classification of the object represented 
by the observation set. In particular, we design a low complexity solution 
that is able to exploit the properties of the data manifolds with a 
graph-based algorithm. Hence, we formulate the computation of the 
unknown label matrix as a smoothing process on the manifold under the 
constraint that all observations represent an object of one single class. It 
results into a discrete optimization problem, which can be solved by an 
efficient and low complexity algorithm. We demonstrate the performance 
of the proposed graph-based algorithm in the classification of sets of 
multiple images. Moreover, we show its high potential in video-based 
face recognition, where it outperforms state-of-the-art solutions that fall 
short of exploiting the manifold structure of the face image data sets. 

Index Terms — Graph-based classification, multiple observations sets, 
video face recognition, multi-view object recognition. 



L Introduction 

Recent years have witnessed a dramatic growth of the amount of 
digital data that is produced by sensors or computers of all sorts. That 
creates the need for efficient processing and analysis algorithms in 
order to extract the relevant information contained in these datasets. 
In particular, it commonly happens that multiple observations of 
an object are captured at different time instants or under different 
geometric transformations. For instance, a moving object may be 
observed over a time interval by a surveillance camera (see Fig. 1(a)) 
or under different viewing angles by a network of vision sensors (see 
Fig. 1(b)). This typically produces a large volume of multimedia 
content that lends itself as a valuable source of information for 
effective knowledge discovery and content analysis. In this context, 
classification methods should be able to exploit the diversity of 
the multiple observations in order to provide increased classification 
accuracy [1]. 

We build on our previous work [2] and we focus here on the pattern 
classification problem with multiple observations. We further assume 
that observations are produced from the same object under different 
transformations, so that they all lie on the same low-dimensional 
manifold. We propose a novel graph-based algorithm built on label 
propagation [3]. Label propagation methods typically assume that the 
data lie on a low dimensional manifold living in a high dimensional 
space. They rely upon the smoothness assumption, which states that 
if two data samples xi and X2 are close, then their labels yi and 
2/2 should be close as well. The main idea of these methods is to 
build a graph that captures the geometry of this manifold as well as 
the proximity of the data samples. The labels of the test examples 
are derived by "propagating" the labels of the labelled data along the 
manifold, while making use of the smoothness property. We exploit 
the specificities of our particular classification problem and constrain 

This work has been mostly performed while the first author was with the 
Signal Processing Laboratory (LTS4) of EPFL. It has been partly supported 
by the Swiss National Science Foundation, under grant NCCR IM2. 
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(a) Video frames of a moving object 
object s 




(b) Network of vision sensors 
Fig. 1. Typical scenarios of producing multiple observations of an object. 



the unknown labels to correspond to one single class. This leads to the 
formulation of a discrete optimization problem that can be optimally 
solved by a simple and low complexity algorithm. 

We apply the proposed algorithm to the classification of sets 
of multiple images in handwritten digit recognition, multi-view 
object recognition or video-based face recognition. In particular, we 
show the high potential of our graph-based method for efficient 
classification of images that belong to the same data manifold. For 
example, the proposed solution outperforms state-of-the-art subspace 
or statistical classification methods in video-based face recognition 
and object recognition from multiple image sets. Hence, this paper 
establishes new connections between graph-based algorithms and the 
problems of classification of multiple image sets or video-based 
face recognition, where the proposed solutions are certainly very 
promising. 

The paper is organized as follows. We first formulate the problem 
of classification of multiple observation sets in Section II. We 
introduce our graph-based algorithm inspired by label propagation 
in Section III. Then we demonstrate the performance of the pro- 
posed classification method for handwritten digit recognition, object 
recognition and video-based face recognition in Sections IV- A, IV-B 
and V, respectively. 

II. Problem DEFINITION 

We address the problem of the classification of multiple obser- 
vations of the same object, possibly with some transformations. In 
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• labelled example 
o unlabelled example 




Fig. 2. Typical structure of the /c-NN graph. J\fi represents the neighborhood 
of the sample xi. 



particular, the problem is to assign multiple observations of the test 
pattern/object s to a single class of objects. We assume that we have 
m transformed observations of s of the following form 

Xi — U {r]i)s, 2 = 1, . . . , m, 

where U{ri) denotes a (geometric) transformation operator with 
parameters 77, which is applied on s. For instance, in the case of visual 
objects, U (77) may correspond to a rotation, scaling, translation, or 
perspective projection of the object. We assume that each observation 
Xi is obtained by applying a transformation 77^ on s, which is different 
from its peers (i.e., rfi / ry^, for i / j). The problem is to classify 
s in one of the c classes under consideration, using the multiple 
observations Xi, i = 1, . . . , m. 

Assume further that the data set is organized in two parts 
X = {X^^^X^")}, where X^^^ = {xi, X2, . . . , x^} C and 
X*^""^ = {xi+i, . . . ,Xn} C M"^, where n = / + m. Let also 
C = {1, . . . ,c} denote the label set. The / examples in X^^^ are 
labelled {^1,^2, . . . , yz}, Vi G £, and the m examples in X^"^^ are 
unlabelled. The classification problem can be formally defined as 
follows. 

Problem 1: Given a set of labelled data X^^\ and a set of 
unlabelled data X^""^ = {xj = U{rjj)s, j = that 
correspond to multiple transformed observations of s, the problem 
is to predict the correct class c* of the original pattern s. 



One may view Problem 1 as a special case of semi-supervised 
learning [4], where the unlabelled data X^^^ represent the multiple 
observations with the extra constraint that all unlabelled data exam- 
ples belong to the same (unknown) class. The problem then resides in 
estimating the single unknown class, while generic semi-supervised 
learning problems attribute the test examples to different classes. 

III. Graph-based classification 

A. Label propagation 

We propose in this section a novel method to solve Problem 1, 
which is inspired by label propagation [3]. The label propagation 
algorithm is based on a smoothness assumption, which states that 
if xi and X2 are close by, then their corresponding labels yi and 
y2 should be close as well. Denote by M the set of matrices with 
nonnegative entries, of size n x c. Notice that any matrix M ^ M. 
provides a labelling of the data set by applying the following rule: 
yi = maxj=i,...,c Mij. We denote the initial label matrix as y G 



A4 where Yij = 1 if x^ belongs to class j and otherwise. The 
label propagation algorithm first forms the k nearest neighbor (k- 
NN) graph defined as 

where the vertices V correspond to the data samples X. An edge 
eij G ^ is drawn if and only if xj is among the k nearest neighbors 

of Xi. 

It is common practice to assign weights on the edge set of Q. One 
typical choice is the Gaussian weights 



Hi 



exp( — ""^J^J ) when (ij) e 
otherwise. 



(1) 



The similarity matrix ^S* G R'^'^'^ is further defined as 

S = (2) 

where D is a diagonal matrix with entries Da = J2^=i ^^j- 
also Fig. 2 for a schematic illustration of the k-NN graph and related 
notation. 

Next, the algorithm computes a real valued M* e M based 
on which the final classification is performed using the rule yi = 

maxj= 

a cost function defined as 



, Mij . This is done via a regularization framework with 
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M^||M,-y,||2), 



(3) 



where Mi denotes the ith row of M. The computation of M* 
is done by solving the quadratic optimization problem M* = 
arg minMG Ai ^ (M) . 

Intuitively, we are seeking an M* that is smooth along the edges of 
similar pairs {xi,Xj) and at the same time close to Y when evaluated 
on the labelled data X^^\ The first term in (3) is the smoothness term 
and the second is the fitness term. 

Notice that when two examples Xi and xj are similar (i.e., the 
weight Hij is large) minimizing the smoothness term in (3) results 
in M being smooth across similar examples. Thus, similar data 
examples will likely share the same class label. It can be shown 
[3] that the solution to problem (3) is given by 



(4) 



^and/3: 



where a ■ 

Finally, several other variants of label propagation have been 
proposed in the past few years. We mention for instance, the method 
of [5] and the variant of label propagation that was inspired from the 
Jacobi iteration algorithm [4, Ch. 11]. Finally, it is interesting to note 
that there have also been found connections to Markov random walks 
[6] and electric networks [7]. Note finally that label propagation is 
probably the most representative algorithm among the graph-based 
methods for semi-supervised learning. 

B. Label propagation with multiple observations 

We propose now to build on graph-based algorithms to solve the 
problem of classification of multiple observation sets. In general, 
label propagation assumes that the unlabelled examples come from 
different classes. As Problem 1 presents the specific constraint that all 
unlabelled data belong to the same class, label propagation does not 
fit exactly the definition of the problem as it falls short of exploiting 
its special structure. Therefore, we propose in the sequel a novel 
graph-based algorithm, which (i) uses the smoothness criterion on 
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Fig. 3. Structure of the class-conditional label matrix Zp. 



the manifold in order to predict the unknown class labels and (ii) at 
the same time, it is able to exploit the specificities of Problem 1. 

We represent the data labels with a 1-of-c encoding, which permits 
to form a binary label matrix of size n x c, whose ith row encodes the 
class label of the ith example. The class label is basically encoded 
in the position of the nonzero element. 

Suppose now that the correct class for the unlabelled data is the pth 
one. In this case, we denote by Zp G R^^^ the corresponding label 
matrix. Note that there are c such label matrices; one for each class 
hypothesis. Each class-conditional label matrix Zp has the following 
form 

Yi e i?^^^ 



le ' G i?" 



(5) 



where Cp G is the pih canonical basis vector and 1 G is the 
vector of ones. Fig. 3 shows schematically the structure of matrix 
Zp. The upper part corresponds to the labelled examples and the 
lower part to the unlabelled ones. Zp holds the labels of all data 
samples, assuming that all unlabelled examples belong to the pih 
class. Observe that the Zp's share the first part Yi and differ only in 
the second part. 

Since all unlabelled examples share the same label, the class labels 
have a special structure that reflects the special structure of Problem 
1, as outlined in our previous work [2]. We could then express the 
unknown label matrix M as. 



M : 



Zp G 



p=i 



where Zp is given in (5), Xp G {0, 1} and 

c 

p=i 



(6) 



(7) 



In the above, A = [Ai, . . . , Ac] is the vector of linear combination 
weights, which are discrete and sum to one. Ideally, A should be 
sparse with only one nonzero entry pointing to the correct class. 

The classification problem now resides in estimating the proper 
value of A. We rely on the smoothness assumption and we propose 
the following objective function 



Q(M(A)) 
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--Mi - 
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(8) 



where the optimization variable now becomes the A vector. Notice 
that the fitting term in Eq. (3) is not needed anymore due to 
the structure of the Z matrices. Furthermore, we observe that the 
optimization parameter A is implicitly represented in the above 
equation through M, defined in eq. (6). 

In the above. Mi (resp. Mj) denotes the ith (resp. jth) row of 
M. In the case of normalized similarity matrix, the above criterion 
becomes 



Q(M(A)) = i E 



\Mi - M, 



(9) 



where ^S* is defined as in (2). It can be seen that the objective function 
directly relies on the smoothness assumption. When two examples Xi, 
Xj are nearby (i.e., Hij or Sij is large), minimizing Q{X) and Q(A) 
results in class labels that are close too. The following proposition 
now shows the explicit dependence of Q on A. 

Proposition 1: Assume the data set is split into I labelled examples 
X^^^ and m unlabelled examples X^''\ i.e., X = [X^^^ , X^^^]. Then, 
the objective function (9) can be written in the following form. 



i<l,j>l 



i>lj<l 



whereC = E,<,,<,5.,l|y.-y,| 



Proof: From equation (9) observe that 

2W = ^E^^iii^^-^if +^E*^^iii^^-^ii 



i,j<i 



i,j>i 



Qi Q2 

1 

i<l,j>l 

^ V ' 

Qs 

+ ^ E Sij\\Mi-Mj\\\ 

i>l,j<l 

^ V ' 

Qa 

We consider the following cases 

(i) i < I and j < I: both data examples Xi and Xj are 
labelled. Then, Mi = (T.l=i^p)yi = Yi, due to the 
special structure of the Z matrices (see (5)) and also due 
to the constraint from Eq. (7). Similarly, Mj = Yj. This 



results in Qi = I J2i,j<i Sij\\Yi 



■Y,\ 



C, which is a 



(ii) 



(iii) 



2 A^i,j<l '^'^J 11^* 

constant term and does not depend on A. 
i > I and j > I: both data samples Xi and xj are unlabelled. 
In this case. Mi = X and Mj = A, again due to (5). 
Therefore the second term Q2 is zero. 

Xi is labelled and Xj is unlabelled. In 



i < I and j > 
this case. Mi — 



Yi and Mj — A. This results in Qa 



\^i<l,j>l^i3\\^^ ' 

(iv) i > I and j < / is analogous to the case (iii) above, 
where the roles of Xi and Xj are switched. Thus, Q4 = 

2 ^i>i,j<i ~ M\ • 

Putting the above facts together yields Eq. (10). ■ 
The above proposition suggests that only the interface between 
labelled and unlabelled examples matters in determining the smooth- 
ness value of a candidate label matrix M, or equivalently the solution 
vector A. We use this observation in order to design an efficient graph- 
based classification algorithm that is described below. 



4 



Algorithm 1 The MASC algorithm 
1: Input: 

X eR'^'''^: data examples. 

m: number of observations. 

I: number of labelled data. 
2: Output: 

p: estimated unknown class. 
3: Initialization: 

4: Form the k-NN graph g = (V, S). 

5: Compute the weight matrix H G R^^^ and the diagonal matrix 

D, where A, 2 = ^ij- 
6: Compute S = D-^^^HD-^^^. 
7: for ^ = 1 : c do 



10: end for 

11: p = argminp ^(p) 



C. 77ie MA5C algorithm 

We propose in this section a simple, yet effective graph-based 
algorithm for the classification of multiple observations from the same 
class. Based on Proposition 1 and ignoring the constant term, we need 
to solve the following optimization problem 

Optimization problem: OPT 

subject to 

Ap e {0,l},p = l,...,c, 

E;=iAp-1. 

Intuitively, we seek the class that corresponds to the smoothest 
label assignment between labelled and unlabelled data. Observe that 
the above problem is a discrete optimization problem due to the 
constraints imposed on A, that can be collected in a set A, where 

c 

A = {A G i?"^' : Ap G {0, = 1, . . . , c, ^ Ap = 1}. 

p=i 

Interestingly, the search space A is small. In particular, it consists of 
the following c vectors: 

[1,0,...,0,...,0] 
[0,1,...,0,...,0] 

[0,0,...,1,...,0] 
[0,0,...,0,...,1]. 

Thus, one may solve OPT by enumerating all above possible solutions 
and pick the one A* that minimizes Q(A). Then, the position of 
the nonzero entry in A* yields the estimated unknown class. We 
call this algorithm MAnifold-based Smoothing under Constraints 
(MASC) and we show its main steps in Algorithm 1. The MASC 
algorithm has a complexity that is linear with the number of classes, 
and quadratic with the number of samples.The construction of k- 
NN graph (lines 4-6) scales as O(n^). Once the graph has been 
constructed, the enumeration of all possible solutions scales as 0(c). 
We conclude that the total computational cost is 0(n^ + c). 

IV. Classification of multiple images sets 

A. Handwritten digit classification 

We evaluate the performance of the proposed MASC algorithm 
with respect to label propagation, in the context of handwritten 



digit classification. Multiple transformed images of the same digit 
class form a set of observations, which we want to assign in the 
correct class. We use two different data sets for our experimental 
evaluation; (i) a handwritten digit image collection^ and (ii) the USPS 
handwritten digit image collection. The first collection contains 20 x 
16 bit binary images of "0" through "9", where each class contains 39 
examples. The USPS collection contains 16 x 16 grayscale images 
of digits and each class contains 1100 examples. 

Robustness to pattern transformations is a very important property 
of the classification of multiple observations. Transformation invari- 
ance can be reinforced into classification algorithms by augmenting 
the labelled examples with the so-called virtual samples, denoted 
hereby as X^^^^ (see [8] for a similar approach). The virtual samples 
are essentially data samples that are generated artificially, by applying 
transformations to the original data samples. They are given the 
class labels of the original examples that they have been generated 
from, and are treated as labelled data. By including the virtual 
samples in the data set, any classification algorithm becomes more 
robust to transformations of the test examples. We therefore adopt 
this strategy in the proposed methods and we include rivs virtual 
samples X^^^^ in our original data set that is finally written as 

We compare the classification performance of the MASC algorithm 
with the label propagation (LP) method. In LP, the estimated class is 
computed by majority voting on the estimated class labels computed 
in Eq. (4). In our experiments, we use the same /c-NN graph in 
combination with the Gaussian weights from Eq. (1) in both LP and 
MASC methods. In order to determine the value of the parameter a 
in Eq. (1) we adopt the following process; we pick randomly 1000 
examples, compute their pairwise distances and then set a equal to 
half of its median. 

We first split the data sets into training and test sets by including 
2 examples per class in the training set and the remaining are 
assigned to the test set. Each training sample is augmented by 4 
virtual examples generated by successive rotations of it, where each 
rotation angle is sampled regularly in [—40°, 40°]. This interval has 
been chosen to be sufficiently small in order to avoid the confusion 
of digits '6' and '9'. Next, in order to build the unlabelled set 
X^^"^ (i.e., multiple observations) of a certain class, we choose 
randomly a sample from the test set of this class and then we apply 
a random rotation on it by a random (uniformly sampled) angle 
e [-40°, 40°]. 

The number of nearest neighbors was set to k — 5 for both binary 
digit collection and the USPS data set, in both methods. These values 
of k have been obtained by the best performance of LP on the 
test set. We try different sizes of the unlabelled set (i.e., multiple 
observations), namely m = [10 : 20 : 150] (in MATLAB notation). 
For each value of m, we report the average classification error rate 
across 100 random realizations of X^^^ generated from each one of 
the 10 classes. Thus, each point in the plot is an average over 1000 
random experiments. 

Figures 4(a) and 4(b) show the results over the binary digits 
and the USPS digits image collections, respectively. Observe first 
that increasing the number of observations gradually improves the 
classification error rate of both methods. This is expected since 
more observations of a certain pattern give more evidence, which 
in turn results in higher confidence in the estimated class label. 
Finally, observe that the proposed MASC algorithm unsurprisingly 
outperforms LP in both data sets, since it is designed to exploit the 
particular structure of Problem 1. 

^http://www.cs.toronto.edu/~roweis/data.html 
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number of observations 

(a) Binary digits 

Fig. 4. Classification results measured on two different data sets. 

B. Object recognition from multi-view image sets 

In this section we evaluate our graph-based algorithm in the context 
of object recognition from multi-view image sets. In this case, the 
different views are considered as multiple observations of the same 
object, and the problem is to recognize correctly this object. 

The proposed MASC method implements Gaussian weights (1) and 
sets /c = 5 in the construction of the /c-NN graph. We compare MASC 
to well-known methods from the literature, which mostly gather 
algorithms based on either subspace analysis or density estimation 
(statistical methods): 

• MSM. The Mutual Subspace Method [9], [10], which is the most 
well known representative of the subspace analysis methods. 
It represents each image set by a subspace spanned by the 
principal components, i.e., eigenvectors of the covariance matrix. 
The comparison of a test image set with a training one is then 
achieved by computing the principal angles [11] between the 
two subspaces. In our experiments, the number of principal 
components has been set to nine, which has been found to 
provide the best performance. 

• KMSM. MSM has been extended to its nonlinear version called 
the Kernel Mutual Subspace Method (KMSM) [12], in order 
to take into account the nonlinearity of typical image sets. The 
main difference of KMSM from MSM is that the images are 
first nonlinearly mapped into a high dimensional feature space, 
before modeling by linear subspaces takes place. In other words, 
KMSM uses kernel PCA instead of PCA in order to capture the 
nonlinearities in the data. In KMSM, we use the Gaussian kernel 
k{x,y) = exp(— ^^1^), where a is determined exactly in the 
same way as in the Gaussian weights of our MASC method. 

• KLD. The KL-divergence algorithm by Shakhnarovich et al [13] 
is the most popular representative of density-based statistical 
methods. It formulates the classification from multiple images 
as a statistical hypothesis testing problem. Under the i.i.d and 
the Gaussian assumptions on the image sets, the classification 
problem typically boils down to a computation of the KL 
divergence between sets, which can be computed in closed form 
in this case. The energy cut-off, which determines the number of 
principal components used in the regularization of the covariance 
matrices, has been set to 0.96. 

In our evaluation, we use the ETH-80 image set [14], which 
contains 80 object classes from 8 categories; apple, car, cow, cup, 
dog, horse, pear and tomato. Each category has 10 object classes 




number of observations 

(b) USPS digits 



MASC 


MSM 


KMSM 


KLD 


88.88 (1.71) 


74.88 (5.02) 


83.2500 (3.4) 


52.5 (3.95) 



TABLE I 

Object recognition rate in the mean(std) format, measured on 
the eth-80 database. 



(see Fig. 5(a)). Each object class then consists of 41 views of the 
object spaced evenly over the upper viewing hemisphere. Figure 5(b) 
shows the 41 views from a sample car object class. We use the 
cropped-closel28 part of the database. All provided images 
are of size 128 x 128 and they are cropped, so that they contain only 
the object without any border area. We downsampled the images to 
size 32x32 for computational ease. No further preprocessing is done. 

The 41 views from each object class are split randomly into 21 
training and 20 test samples. In this case, the 20 different views 
in the test set correspond to the multiple observations of the test 
object. We perform 10 random experiments where the images are 
randomly split into training and test sets. Table I shows the average 
object recognition rate for each method. We also report the standard 
deviation of each method in parentheses. Notice that the subspace 
methods are superior to the KLD method which assumes Gaussian 
distribution of the data. Notice also that as one would expect, KMSM 
outperforms MSM that falls short of capturing the nonlinearities 
in the data. Finally, observe that our graph-based method clearly 
outperforms its competitors, as it is able to capture not only the 
nonlinearity but also the manifold structure of the data. 

V. Video-based face recognition 

A. Experimental setup 

In this section we evaluate our graph-based algorithm in the context 
of face recognition from video sequences. In this case, the different 
video frames are considered as multiple observations of the same 
person, and the problem consists in the correct classification of 
this person. We evaluate in this section the behavior of the MASC 
algorithm in realistic conditions, i.e., under variations in head pose, 
facial expression and illumination. Note in passing that our algorithm 
does not assume any temporal order between the frames; hence, it 
is also applicable to the generic problem of face recognition from 
image sets. 

We use two publically available databases; the VidTIMIT [15] and 
the first subset of the Honda/UCSD [16] database. The VidTIMIT 
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(a) ETH-80 

Fig. 5. Sample images from the ETH-80 database. 

database^ contains 43 individuals and there are three face sequences 
obtained from three different sessions per subject. The data set has 
been recorded in three sessions, with a mean delay of seven days 
between session one and two, and six days between session two and 
three. In each video sequence each person performed a head rotation 
sequence. In particular, the sequence consists of the person moving 
his/her head to the left, right, back to the center, up, then down and 
finally return to center. 

The Honda/UCSD database^ contains 59 sequences of 20 subjects. 
In contrast to the previous database, the individuals move their head 
freely, in different speed and facial expressions. In each sequence, 
the subjects perform free in-plane and out-of-plane head rotations. 
Each person has between 2 and 5 video sequences and the number 
of sequences per subject is variable. 

For preprocessing, in both databases, we used first P. Viola's 
face detector [17] in order to automatically extract the facial region 
from each frame. Note that this typically results in misaligned facial 
images. Next, we downsampled the facial images to size 32x32 for 
computational ease. No further preprocessing has been performed, 
which brings our experimental setup closer to real testing conditions. 

B. Classification results on VidTIMIT 

We first study the performance of the MASC algorithm with the 
VidTIMIT database. Figure 6 shows a few representative images 
from a sample face manifold in the VidTIMIT database. Observe 
the presence of large head pose variations. Figure 7 shows the 
3D projection of the manifold that is obtained using the ONPP 
method [18], which has been shown to be an effective tool for 
data visualization. Notice the four clusters corresponding to the four 
different head poses i.e., looking left, right, up and down. This 
indicates that a graph-based method should be able to capture the 
geometry of the manifold and propagate class labels based on the 
manifold structure. 

Since there are three sessions, we use the following metric for 
evaluating the classification performances 

3 3 

^http://users .rsise. anu.edu. au/~conrad/vidtimit/ 

^http://vision.ucsd.edu/leekc/HondaUCSDVideoDatabase/HondaUCSD.html 




(b) 41 views of a sample car model 




(a) pose 1 (b) pose 2 (c) pose 3 (d) pose 4 




(e) pose 5 (f) pose 6 (g) pose 7 (h) pose 8 



Fig. 6. Head pose variations in the VidTIMIT database. 

where e(i,j) is the classification error rate when the ith session is 
used as training set and the jih session is used as test set. In other 
words, e is the average classification error rate calculated over the 
following six experiments, namely (1,2), (2,1), (1,3), (3,1), (2,3) and 
(3,2). 



Recognition rate (%) 


MASC 


MSM 


KMSM 


KLD 


r = 4 


96.51 


91.47 


95.74 


84.5 


r = 8 


96.51 


87.21 


94.19 


81.4 


r = 12 


94.96 


85.66 


92.64 


77.52 


r = 16 


93.8 


81.4 


89.15 


72.48 



TABLE II 

Video face recognition results on the VidTIMIT database. 



We evaluate the video face recognition performance of all methods 
for diverse sizes of the training and test sets. The objective is to assess 
the robustness of the methods with respect to the size of the training 
and test set. For this reason, each image set is re-sampled as 

Xi^r = Xi{:,l:r:n), z=l,...,c. 

In the above, the image set Xi is re-sampled with step r, i.e., only one 
image every r images is kept. In our experiments, we use different 
values of r ranging from 4 to 16 with step 4. For each value of r, we 
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Fig. 7. A typical face manifold from the VidTIMIT database. Observe the 
four clusters corresponding to the four different head poses (face looking left, 
right, up and down). 
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Fig. 9. Head pose variations in the Honda/UCSD database. 
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Fig. 10. A typical face manifold from the Honda/UCSD database. 



Fig. 8. Video face recognition results on the VidTIMIT database. 



measure the average classification error rate according to the relation 
(11). 

Table II shows the recognition performance, for r ranging from 
4 to 16 with step 4. Figure 8 shows graphically the same results. 
Observe that the KLD method that relies on density estimation is 
sensitive to the number of the available data. Also, notice that MSM is 
superior to KLD, which is expected since KLD relies on the imprecise 
assumption that data follow a Gaussian distribution. Furthermore, 
KMSM, the nonlinear variant of MSM, outperforms the latter that 
has trouble in capturing the nonlinear structures in the data. Finally, 
we observe that MASC clearly outperforms its competitors in the 
vast majority of cases. At the same time, it stays robust to significant 
re- sampling of the data, since its performance remains almost the 
same for each value of r. 

C. Classification results on Honda/UCSD 

We further study the video-based face recognition performance 
on the Honda/UCSD database. Figure 9 shows a few representative 
images from a sample face manifold in the Honda/UCSD database. 
Observe the presence of large head pose variations along with facial 
expressions. The projection of the manifold on the 3D space using 
ONPP shows again clearly the manifold structure of the data (see 
Figure 10), which implies that a graph-based method is more suitable 
for such kind of data. 



The Honda/UCSD database comes with a default splitting into 
training and test sets, which contains 20 training and 39 test video 
sequences. We use this default setup and we report the classification 
performance of all methods, under different data re-sampling rates. 
Similarly as above, both training and test image sets are re-sampled 
with step r, i.e., Xi^r = 1 : r : n), i = 1, . . . , c. Table III 
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Fig. 11. Video face recognition results on the Honda/UCSD database. 
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Recognition rate (%) 


MASC 


MSM 


KMSM 


KLD 


r = 4 


100 


84.62 


87.18 


84.62 


r = 6 


100 


84.62 


87.18 


79.49 


r = 8 


97.44 


84.62 


84.62 


61.54 


r = 10 


97.44 


87.18 


84.62 


66.67 


r = 12 


97.44 


76.92 


82.05 


61.54 



TABLE III 

Video face recognition results on the Honda/UCSD database. 



shows the recognition rates, when r varies from 4 to 12 with step 
2. Figure 11 shows the same results graphically. Recall that larger 
values of r imply sparser image sets. Observe again that KLD is 
mostly affected by r, by suffering loss in performance. This is not 
surprising since it is a density-based method and densities cannot be 
accurately estimated (in general) with a few samples. MSM seems 
to be more robust, yielding better results than KLD, but as expected, 
it is inferior to KMSM in the majority of cases. Finally, MASC is 
again the best performer and it exhibits very high robustness against 
data re- sampling. 

Regarding the relative performance of MASC and KMSM, we 
should finally stress out that KMSM is a kernel technique that 
attempts to capture the nonlinear structure of the data by assuming 
a linear model after applying a nonlinear mapping of the data into a 
high dimensional space. Although this methodology stays generic and 
presents certain advantages, it is still not clear whether it is capable 
of capturing the individual (e.g., manifold) structure of diverse data 
sets. On the other hand, the MASC method explicitly relies on a 
graph model that may fit much better the manifold structure of 
the data. Furthermore, it provides a way to cope with the curse 
of dimensionality, since the intrinsic dimension of the manifolds is 
typically very small. We believe that graph methods have a great 
potential in this field. 

D. Video-based face recognition overview 

For the sake of completeness, we review briefly in this last section 
the state of the art in video-based face recognition. Typically, one 
may distinguish between two main families of methods; those that 
are based on subspace analysis and those that are based on density 
estimation (statistical methods). The most representative methods for 
these two families are respectively the MSM [9], [10] and KMSM 
[12] methods and the solution based on KLD [13], which have been 
used in the experiments above. 

Among the methods based on subspace analysis, we should men- 
tion the extension of principal angles from subspaces, to nonlinear 
manifolds. In a recent article [19] it was proposed to represent 
the facial manifold by a collection of linear patches, which are 
recovered by a non-iterative algorithm that augments the current patch 
until the linearity criterion is violated. This manifold representation 
allows for defining the distance between manifolds as integration of 
distances between linear patches. For comparing two linear patches, 
the authors propose a distance measure that is a mixture between 
(i) the principal angles and (ii) exemplar-based distance. However, 
it is not clearly justified why such a mixture is needed and what 
is the relative benefit over the individual distances. Moreover, their 
proposed method requires the computation of both geodesic and 
Euclidean distances as well as setting four parameters. On the 
contrary, our MASC method needs only one parameter (k) to be set 
and it requires the computation of the Euclidean distances only. Note 
finally that their method achieves comparable results with MASC on 
the Honda/UCSD database, but at a higher computational cost and at 
the price of tuning four parameters. 



Along the same lines, the authors in [20] propose a similarity 
measure between manifolds that is a mixture of similarity between 
subspaces and similarity between local linear patches. Each individual 
similarity is based on a weighted combination of principal angles and 
those weights are learnt by AdaBoost for improved discriminative 
performance. In contrast to the previous paper [19], the linear patches 
are extracted here using mixtures of Probabilistic PC A (PPCA). 
PPCA mixture fitting is a highly non-trivial task, which requires an 
estimate of the local principal subspace dimension and it also involves 
model selection. This step is quite computationally intensive, as noted 
in [19]. 

The main limitation of the statistical methods such as KLD [13] 
is the inadequacy of the Gaussianity assumption of face images 
sets; face sequences rather have a manifold structure. The test video 
frames are moreover not independent, so that the i.i.d assumption is 
unrealistic as well. The authors in [21] therefore extend the work 
of KL divergence by replacing the Gaussian densities by Gaussian 
Mixture Models (GMMs), which provides a more flexible method for 
density estimation. However, the KL divergence in this case cannot 
be computed in a closed form, which makes the authors to resort to 
Monte Carlo simulations that are quite computationally intensive. 

Finally, there have been a few other methods that cannot be directly 
categorized in the above families of methods. The authors in [22] 
propose ensemble similarity metrics that are based on probabilistic 
distance measures, evaluated in Reproducing Kernel Hilbert spaces. 
All computations are performed under the Gaussianity assumption, 
which is unfortunately not realistic for facial manifolds. 

In [23], the authors provide a probabilistic framework for face 
recognition from image sets. They model the identity as a discrete or 
continuous random variable and they provide a statistical framework 
for estimating the identity by marginalizing over face localization, 
illumination and head pose. Illumination-invariant basis vectors are 
learnt for each (discretized) pose and the resulting subspace is used 
for representing the low dimensional vector that encodes the subject 
identity. However, the statistical framework requires the computation 
of several integrals that are numerically approximated. Also, the 
proposed method assumes that training images are available for every 
subject at each possible pose and illumination, which is hard to satisfy 
in practice. 

X. Liu and T. Chen in [24] proposed a methodology based on 
adaptive hidden Markov models for video-based face recognition. 
The temporal dynamics of each subject are learnt during training and 
subsequently used for recognition. However, the proposed approach 
assumes temporal order of the frames in the face sequence and 
unfortunately it is not applicable to the more generic problem of 
recognition from image sets. The study in [25] further investigates 
how the performance of the above approach is affected by the face 
sequence length and the image quality. 

VI. Conclusions 

In this paper we have addressed the problem of classification of 
multiple observations of the same object. We have proposed to exploit 
the specific structure of this problem in a graph-based algorithm 
inspired by label propagation. The graph-based algorithm relies on 
the smoothness assumption of the manifold in order to learn the 
unknown label matrix, under the constraint that all observations 
correspond to the same class. We have formulated this process as 
a discrete optimization problem that can be solved efficiently by a 
low complexity algorithm. 

We provide experimental results that illustrate the performance 
of the proposed solution for the classification of handwritten digits, 
for object recognition and for video-based face recognition. In the 
two latter cases, the graph-based solution outperforms state-of-the-art 
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methods on three pubUcally available data sets. This clearly outlines 
the potential of the proposed graph-based solution that is able to 
advantageously capture the structure of image manifolds. 
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